Exercise: Implement image captioning app with Gradio

In this exercise, you will walk through the steps to create a web application that generates captions for images using the BLIP-2 model and the Gradio library. Follow the steps below:

Step 1: Set up the environment

  • Make sure you have the necessary libraries installed. Run pip install gradio transformers Pillow to install Gradio, Transformers, and Pillow.
  • Import the required libraries:

Now, let's create a new Python file and call it image_captioning_app.py.

  1. 1
  2. 2
  3. 3
  4. 4
  1. import gradio as gr
  2. import numpy as np
  3. from PIL import Image
  4. from transformers import AutoProcessor, BlipForConditionalGeneration

Step 2: Load the pretrained model

  • Load the pretrained processor and model:
  1. 1
  2. 2
  1. processor = # write your code here
  2. model = # write your code here

Step 3: Define the image captioning function

  • Define the caption_image function that takes an input image and returns a caption:
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  1. def caption_image(input_image: np.ndarray):
  2. # Convert numpy array to PIL Image and convert to RGB
  3. raw_image = Image.fromarray(input_image).convert('RGB')
  4. # Process the image
  5. # Generate a caption for the image
  6. # Decode the generated tokens to text and store it into `caption`
  7. return caption

Step 4: Create the Gradio interface

  • Use the gr.Interface class to create the web app interface:
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  1. iface = gr.Interface(
  2. fn=caption_image,
  3. inputs=gr.Image(),
  4. outputs="text",
  5. title="Image Captioning",
  6. description="This is a simple web app for generating captions for images using a trained model."
  7. )

Step 5: Launch the Web App

  • Start the web app by calling the launch() method:
  1. 1
  1. iface.launch()
Click here for the answer
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  1. import gradio as gr
  2. import numpy as np
  3. from PIL import Image
  4. from transformers import AutoProcessor, BlipForConditionalGeneration
  5. # Load the pretrained processor and model
  6. processor = AutoProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
  7. model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
  8. def caption_image(input_image: np.ndarray):
  9. # Convert numpy array to PIL Image and convert to RGB
  10. raw_image = Image.fromarray(input_image).convert('RGB')
  11. # Process the image
  12. inputs = processor(raw_image, return_tensors="pt")
  13. # Generate a caption for the image
  14. out = model.generate(**inputs,max_length=50)
  15. # Decode the generated tokens to text
  16. caption = processor.decode(out[0], skip_special_tokens=True)
  17. return caption
  18. iface = gr.Interface(
  19. fn=caption_image,
  20. inputs=gr.Image(),
  21. outputs="text",
  22. title="Image Captioning",
  23. description="This is a simple web app for generating captions for images using a trained model."
  24. )
  25. iface.launch()

Step 6: Run the application

  • Save the complete code to a Python file, for example, image_captioning_app.py.
  • Open a terminal or command prompt, navigate to the directory where the file is located, and run the command
  1. 1
  1. python3 image_captioning_app.py

Click here to start your web app:

Press ctrl + c to quit the application.

You will have such output in the new windows:

open terminal

If you are running locally: Interact with the web App:

  • The web app should start running and display a URL where you can access the interface.
  • Open the provided URL in a web browser (in the terminal).
  • You should see an interface with an image upload box.

Congratulations! You have created an image captioning web app using Gradio and the BLIP model. You can further customize the interface, modify the code, or experiment with different models and settings to enhance the application's functionality.